Project

Wrangle and Analyze Data

    By Hailu Teju, August 2018


Data

The data for this project are obtained from the WeRateDogs Twitter archive in one form or another. There are three pieces of the data that come in different forms:

  1. twitter_archive_enhanced.csv file will be manually downloaded from Udacity's course web page
  2. image_prediction.tsv file will be downloaded programatically using the Requests library and the following URL: https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv
  3. Additional data containing each tweet's retweet count and favorite ("like") count will be obtained by querying the Twitter API using the tweet IDs in the WeRateDogs Twitter archive that are given in the above two files and Python's Tweepy library. The JSON data obtained will be stored in a file called tweet_json.txt on a local machine

Tasks in this project

  • Data wrangling, which consists of:

    • Gathering data
    • Assessing data
    • Cleaning data
  • Storing, analyzing, and visualizing your wrangled data

  • Reporting on 1) your data wrangling efforts and 2) your data analyses and visualizations

Project Details

Assessing Data for this Project

After gathering each of the above pieces of data, assess them visually and programmatically for quality and tidiness issues. Detect and document at least eight (8) quality issues and two (2) tidiness issues in your wrangle_act.ipynb Jupyter Notebook. To meet specifications, the issues that satisfy the Project Motivation (see the Key Points header on the previous page) must be assessed.

Cleaning Data for this Project

Clean each of the issues you documented while assessing. Perform this cleaning in wrangle_act.ipynb as well. The result should be a high quality and tidy master pandas DataFrame (or DataFrames, if appropriate). Again, the issues that satisfy the Project Motivation must be cleaned.

Storing, Analyzing, and Visualizing Data for this Project

Store the clean DataFrame(s) in a CSV file with the main one named twitter_archive_master.csv. If additional files exist because multiple tables are required for tidiness, name these files appropriately. Additionally, you may store the cleaned data in a SQLite database (which is to be submitted as well if you do).

Analyze and visualize your wrangled data in your wrangle_act.ipynb Jupyter Notebook. At least three (3) insights and one (1) visualization must be produced.

Reporting for this Project

Create a 300-600 word written report called wrangle_report.pdf or wrangle_report.html that briefly describes your wrangling efforts. This is to be framed as an internal document.

Create a 250-word-minimum written report called act_report.pdf or act_report.html that communicates the insights and displays the visualization(s) produced from your wrangled data. This is to be framed as an external document, like a blog post or magazine article, for example.



Here it begins ...


I will start by loading all the libraries that I may need
In [1]:
import numpy as np
import pandas as pd
import os
import re
import time
import requests
import tweepy
import json
import matplotlib.pyplot as plt
%matplotlib inline

Gather Data

Gathering data source 1
In [112]:
# Create folder for the data sets
folder_name = 'we_rate_dogs'
if not os.path.exists(folder_name):
    os.makedirs(folder_name)
In [113]:
# Read twitter-archive-enhanced.csv and image-predictions.tsv files
folder_name = 'we_rate_dogs'
twitter_archive_enhanced = pd.read_csv(os.path.join(folder_name, 'twitter-archive-enhanced.csv'))
Gathering data source 2
In [114]:
# URL for the image-prediction.tsv file
url = "https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions\
/image-predictions.tsv"
 
folder_name = 'we_rate_dogs'

response = requests.get(url)

# Write image-predictions file to folder
filename = os.path.join(folder_name, url.split('/')[-1])
with open(filename, mode='wb') as file:
    file.write(response.content)
In [115]:
folder_name = 'we_rate_dogs'

image_predictions = pd.read_csv(os.path.join(folder_name, 'image-predictions.tsv'), sep='\t')
Gathering data source 3

Since I have already gotten timestamp and image-url (jpg_url) information in the two files above, I don't need to store these information again in the below file obtained from tweet JSON data.

In [5]:
tweet_ids = twitter_archive_enhanced['tweet_id'].astype(str)

# Query data from the WeRateDog Twitter archive using the Tweepy library

# consumer_key = ************************
# consumer_secret = ***********************************
# access_token = ***************************************
# access_secret = ************************************

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)
api = tweepy.API(auth, wait_on_rate_limit = True, wait_on_rate_limit_notify=True)

start = time.time()

data = {}
data['dog_rating'] = []
tweet_errors = []

for tweet_id in tweet_ids:
    try:
        tweet = api.get_status(tweet_id, tweet_mode='extended')
        data['dog_rating'].append({
            'tweet_id': int(tweet.id),
            #'created_at': tweet.created_at,
            #'img_url': tweet._json['entities']['media'][0]['media_url_https'],
            'retweet_count': tweet.retweet_count,
            'favorite_count': tweet.favorite_count
            })
        
    except Exception as e:
        print(tweet_id + " - " + str(e))
        tweet_errors.append(tweet_id)
        
end = time.time()
print('Elapsed time:', end - start)
888202515573088257 - [{'code': 144, 'message': 'No status found with that ID.'}]
873697596434513921 - [{'code': 144, 'message': 'No status found with that ID.'}]
869988702071779329 - [{'code': 144, 'message': 'No status found with that ID.'}]
866816280283807744 - [{'code': 144, 'message': 'No status found with that ID.'}]
861769973181624320 - [{'code': 144, 'message': 'No status found with that ID.'}]
845459076796616705 - [{'code': 144, 'message': 'No status found with that ID.'}]
842892208864923648 - [{'code': 144, 'message': 'No status found with that ID.'}]
837012587749474308 - [{'code': 144, 'message': 'No status found with that ID.'}]
827228250799742977 - [{'code': 144, 'message': 'No status found with that ID.'}]
802247111496568832 - [{'code': 144, 'message': 'No status found with that ID.'}]
775096608509886464 - [{'code': 144, 'message': 'No status found with that ID.'}]
770743923962707968 - [{'code': 144, 'message': 'No status found with that ID.'}]
Rate limit reached. Sleeping for: 658
758740312047005698 - Failed to send request: ('Connection aborted.', OSError("(10054, 'WSAECONNRESET')",))
754011816964026368 - [{'code': 144, 'message': 'No status found with that ID.'}]
Rate limit reached. Sleeping for: 653
676957860086095872 - Failed to send request: ('Connection aborted.', OSError("(10054, 'WSAECONNRESET')",))
Elapsed time: 2005.9169504642487

Acknowledgement: The helpful definition for serializing the datetime object tweet.created_at that was not serializable by the default json code is obtained from a Stack Overflow solution.

In [6]:
from datetime import date, datetime

def json_serial(obj):
    """JSON serializer for objects not serializable by default json code"""
    
    if isinstance(obj, (datetime, date)):
        return obj.isoformat()
    raise TypeError ("Type %s not serializable" % type(obj))
    
In [7]:
folder_name = 'we_rate_dogs'
file_name = os.path.join(folder_name, 'tweet_json.txt')

with open(file_name, 'w') as outfile:
    json.dump(data, outfile, default=json_serial)
In [117]:
folder_name = 'we_rate_dogs'
file_name = os.path.join(folder_name, 'tweet_json.txt')

with open ('tweet_json_img.txt') as json_file:
    data = json.load(json_file)
In [118]:
tweet_json = pd.DataFrame(data['dog_rating'], 
                          columns=['tweet_id', 'retweet_count', 'favorite_count'])

Assess Data

Programatic Assessment

twitter_archive_enhanced - data:

In [119]:
twitter_archive_enhanced.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id                      2356 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2356 non-null object
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null object
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
doggo                         2356 non-null object
floofer                       2356 non-null object
pupper                        2356 non-null object
puppo                         2356 non-null object
dtypes: float64(4), int64(3), object(10)
memory usage: 313.0+ KB

Issue: timestamp and retweeted_status_timestamp are of data types strings instead of datetime

In [120]:
twitter_archive_enhanced.sample(10)
Out[120]:
tweet_id in_reply_to_status_id in_reply_to_user_id timestamp source text retweeted_status_id retweeted_status_user_id retweeted_status_timestamp expanded_urls rating_numerator rating_denominator name doggo floofer pupper puppo
1170 720775346191278080 NaN NaN 2016-04-15 00:46:48 +0000 <a href="http://twitter.com/download/iphone" r... This is Berkeley. He's in a predicament. 10/10... NaN NaN NaN https://twitter.com/dog_rates/status/720775346... 10 10 Berkeley None None None None
1808 676897532954456065 NaN NaN 2015-12-15 22:52:02 +0000 <a href="http://twitter.com/download/iphone" r... Exotic handheld dog here. Appears unathletic. ... NaN NaN NaN https://twitter.com/dog_rates/status/676897532... 5 10 None None None None None
1261 710153181850935296 NaN NaN 2016-03-16 17:18:07 +0000 <a href="http://twitter.com/download/iphone" r... This is Lacy. She's tipping her hat to you. Da... NaN NaN NaN https://twitter.com/dog_rates/status/710153181... 11 10 Lacy None None None None
566 802247111496568832 NaN NaN 2016-11-25 20:26:31 +0000 <a href="http://twitter.com/download/iphone" r... RT @dog_rates: Everybody drop what you're doin... 7.790561e+17 4.196984e+09 2016-09-22 20:33:42 +0000 https://twitter.com/dog_rates/status/779056095... 13 10 None None None None None
1570 687732144991551489 NaN NaN 2016-01-14 20:24:55 +0000 <a href="http://vine.co" rel="nofollow">Vine -... This is Ember. That's the q-tip she owes money... NaN NaN NaN https://vine.co/v/iOuMphL5DBY 11 10 Ember None None None None
1872 675146535592706048 NaN NaN 2015-12-11 02:54:12 +0000 <a href="http://twitter.com/download/iphone" r... This is Coops. He's yelling at the carpet. Not... NaN NaN NaN https://twitter.com/dog_rates/status/675146535... 7 10 Coops None None None None
1508 691483041324204033 NaN NaN 2016-01-25 04:49:38 +0000 <a href="http://twitter.com/download/iphone" r... When bae says they can't go out but you see th... NaN NaN NaN https://twitter.com/dog_rates/status/691483041... 5 10 None None None None None
1269 709519240576036864 NaN NaN 2016-03-14 23:19:03 +0000 <a href="http://twitter.com/download/iphone" r... This is Vince. He's a Gregorian Flapjeck. Whit... NaN NaN NaN https://twitter.com/dog_rates/status/709519240... 9 10 Vince None None None None
209 852226086759018497 NaN NaN 2017-04-12 18:25:07 +0000 <a href="http://twitter.com" rel="nofollow">Tw... Meet General. He wasn't content with the quali... NaN NaN NaN https://twitter.com/dog_rates/status/852226086... 14 10 General None None None None
1094 736736130620620800 NaN NaN 2016-05-29 01:49:16 +0000 <a href="http://twitter.com/download/iphone" r... This is Chadrick. He's gnarly af 13/10 https:/... NaN NaN NaN https://twitter.com/dog_rates/status/736736130... 13 10 Chadrick None None None None
In [121]:
twitter_archive_enhanced.name.value_counts()
Out[121]:
None         745
a             55
Charlie       12
Lucy          11
Cooper        11
Oliver        11
Penny         10
Lola          10
Tucker        10
Winston        9
Bo             9
the            8
Sadie          8
Daisy          7
Bailey         7
an             7
Buddy          7
Toby           7
Scout          6
Jack           6
Bella          6
Koda           6
Dave           6
Rusty          6
Jax            6
Oscar          6
Leo            6
Milo           6
Stanley        6
very           5
            ... 
Tonks          1
Fiji           1
Jareld         1
Gordon         1
Taco           1
Rodman         1
Rumble         1
Koko           1
Damon          1
Hermione       1
Butters        1
Timber         1
Tedders        1
Arya           1
Mabel          1
Aldrick        1
Shnuggles      1
Kevon          1
Cecil          1
Tug            1
Nimbus         1
Eugene         1
Julio          1
Amber          1
Moofasa        1
Walker         1
Rizzo          1
Kota           1
Enchilada      1
Harrison       1
Name: name, Length: 957, dtype: int64

Issue: Some dog names don't seem to be valid dog names and they either 'None' or seem to start with lowercase letters.

In [122]:
# Unreal dog names that also start with lowercase letters or labeled as 'None'
twitter_archive_enhanced.name.str.extract('(^[a-z]+|None)', expand=True).loc[:,0].value_counts()
Out[122]:
None            745
a                55
the               8
an                7
very              5
quite             4
just              4
one               4
getting           2
mad               2
not               2
actually          2
light             1
life              1
his               1
all               1
space             1
infuriating       1
this              1
by                1
unacceptable      1
incredibly        1
old               1
such              1
my                1
officially        1
Name: 0, dtype: int64
In [123]:
# I will call the list of these unreal seeming names as unreal_names
unreal_names = twitter_archive_enhanced.name.str.extract('(^[a-z]+|None)', expand=True).loc[:,0].value_counts().keys()
unreal_names
Out[123]:
Index(['None', 'a', 'the', 'an', 'very', 'quite', 'just', 'one', 'getting',
       'mad', 'not', 'actually', 'light', 'life', 'his', 'all', 'space',
       'infuriating', 'this', 'by', 'unacceptable', 'incredibly', 'old',
       'such', 'my', 'officially'],
      dtype='object')
In [124]:
# Look at values in rating_denominator 
twitter_archive_enhanced.rating_denominator.value_counts()
Out[124]:
10     2333
11        3
50        3
80        2
20        2
2         1
16        1
40        1
70        1
15        1
90        1
110       1
120       1
130       1
150       1
170       1
7         1
0         1
Name: rating_denominator, dtype: int64

Note: Rating denominator almost always equals 10. Here we see rating denominators that are not equal to 10. Those that are multiples of 10 may be for ratings involving multiple dogs in an image. However, rating denominator values such as 11, 2, 16, 15, 7, and certainly 0 don't make senese.

In [125]:
# Store the above unexpected denominator values in unreal_denoms
unreal_denoms = [0, 2, 7, 11, 15, 16]
In [126]:
for idx in twitter_archive_enhanced[twitter_archive_enhanced.rating_denominator.isin(unreal_denoms)].index:
    print(str(idx) + " - ", twitter_archive_enhanced.loc[idx, 'text'])
    print('\n')
313 -  @jonnysun @Lin_Manuel ok jomny I know you're excited but 960/00 isn't a valid rating, 13/10 is tho


342 -  @docmisterio account started on 11/15/15


516 -  Meet Sam. She smiles 24/7 &amp; secretly aspires to be a reindeer. 
Keep Sam smiling by clicking and sharing this link:
https://t.co/98tB8y7y7t https://t.co/LouL5vdvxx


784 -  RT @dog_rates: After so many requests, this is Bretagne. She was the last surviving 9/11 search dog, and our second ever 14/10. RIP https:/…


1068 -  After so many requests, this is Bretagne. She was the last surviving 9/11 search dog, and our second ever 14/10. RIP https://t.co/XAVDNDaVgQ


1662 -  This is Darrel. He just robbed a 7/11 and is in a high speed police chase. Was just spotted by the helicopter 10/10 https://t.co/7EsP8LmSp5


1663 -  I'm aware that I could've said 20/16, but here at WeRateDogs we are very professional. An inconsistent rating scale is simply irresponsible


2335 -  This is an Albanian 3 1/2 legged  Episcopalian. Loves well-polished hardwood flooring. Penis on the collar. 9/10 https://t.co/d9NcXFKwLv


Issues: As we can see from the texts corresponding to the observations with rating_denominators: 0, 2, 7, 11, 15, and 16, ratings such as 960/00, 24/7, 9/11, 20/16, 1/2 are not real ratings at all.

  • 9/11 should be replaced by 14/10
  • 7/11 should be replaced by 10/10
  • 1/2 should be replaced by 9/10
  • Observations corresponding to the 960/00, 24/7, and 20/16 ratings should be removed
In [127]:
ser = twitter_archive_enhanced.rating_numerator.value_counts()
ser
Out[127]:
12      558
11      464
10      461
13      351
9       158
8       102
7        55
14       54
5        37
6        32
3        19
4        17
1         9
2         9
420       2
0         2
15        2
75        2
80        1
20        1
24        1
26        1
44        1
50        1
60        1
165       1
84        1
88        1
144       1
182       1
143       1
666       1
960       1
1776      1
17        1
27        1
45        1
99        1
121       1
204       1
Name: rating_numerator, dtype: int64
In [128]:
ser[:8].sum() / ser.sum()
Out[128]:
0.9350594227504244

Note: about 93.5% of the rating numerators range between 7 and 14. However, there are a few very large and very small values. The large numerator values could as well be ratings for a number of dogs in the same picture combined. I will see if there are outliers by divinding rating_numerator by rating_denominator after removing the observation with zero rating_denominator.

Issue: rating_numerator and rating_denominator together convey a single value, that is a rating for the dog, and hence we don't need two separate columns for dog rating in the final cleaned version of our data file (separating the two, however, is a very good idea for clean up purposes).

In [129]:
twitter_archive_enhanced[['doggo', 'puppo', 'pupper', 'floofer']].describe()
Out[129]:
doggo puppo pupper floofer
count 2356 2356 2356 2356
unique 2 2 2 2
top None None None None
freq 2259 2326 2099 2346

Note: according to THE DOGTIONARY:

  • doggo - represents a big pupper, usually older
  • pupper - represents a small doggo, usually younger, and
  • puppo - represents a transitional phase between pupper and doggo.
  • floof - represents any dog really. However, this label is commonly given to dogs with seemingly excess fur.

According to THE DOGTIONARY, floofer does not seem to be a dog 'stage'.

Issue: the twitter_archive_enhanced data frame has three columns for one feature - dog 'stage' classification. This breaks the first of the three Tidy data requirements - that each variable forms a column.


image_predictions - data:

In [131]:
image_predictions.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
tweet_id    2075 non-null int64
jpg_url     2075 non-null object
img_num     2075 non-null int64
p1          2075 non-null object
p1_conf     2075 non-null float64
p1_dog      2075 non-null bool
p2          2075 non-null object
p2_conf     2075 non-null float64
p2_dog      2075 non-null bool
p3          2075 non-null object
p3_conf     2075 non-null float64
p3_dog      2075 non-null bool
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB
In [132]:
image_predictions.img_num.value_counts()
Out[132]:
1    1780
2     198
3      66
4      31
Name: img_num, dtype: int64
In [133]:
image_predictions.p1.value_counts()
Out[133]:
golden_retriever             150
Labrador_retriever           100
Pembroke                      89
Chihuahua                     83
pug                           57
chow                          44
Samoyed                       43
toy_poodle                    39
Pomeranian                    38
cocker_spaniel                30
malamute                      30
French_bulldog                26
miniature_pinscher            23
Chesapeake_Bay_retriever      23
seat_belt                     22
German_shepherd               20
Siberian_husky                20
Staffordshire_bullterrier     20
Cardigan                      19
web_site                      19
Maltese_dog                   18
Eskimo_dog                    18
beagle                        18
Shetland_sheepdog             18
teddy                         18
Shih-Tzu                      17
Lakeland_terrier              17
Rottweiler                    17
Italian_greyhound             16
kuvasz                        16
                            ... 
water_bottle                   1
hay                            1
limousine                      1
flamingo                       1
peacock                        1
scorpion                       1
leaf_beetle                    1
bookcase                       1
bakery                         1
envelope                       1
stove                          1
bison                          1
bib                            1
polecat                        1
bighorn                        1
slug                           1
African_crocodile              1
killer_whale                   1
school_bus                     1
grey_fox                       1
walking_stick                  1
electric_fan                   1
silky_terrier                  1
bookshop                       1
timber_wolf                    1
damselfly                      1
jersey                         1
clumber                        1
lacewing                       1
Japanese_spaniel               1
Name: p1, Length: 378, dtype: int64

Issue: there is an inconsistency in the way dog breeds or prediction 'things' are spelled. Some are written in all lowercase letters and some are capitalized.

In [134]:
image_predictions.p1_dog.value_counts()
Out[134]:
True     1532
False     543
Name: p1_dog, dtype: int64
In [135]:
image_predictions[['p1_conf', 'p2_conf', 'p3_conf']].describe()
Out[135]:
p1_conf p2_conf p3_conf
count 2075.000000 2.075000e+03 2.075000e+03
mean 0.594548 1.345886e-01 6.032417e-02
std 0.271174 1.006657e-01 5.090593e-02
min 0.044333 1.011300e-08 1.740170e-10
25% 0.364412 5.388625e-02 1.622240e-02
50% 0.588230 1.181810e-01 4.944380e-02
75% 0.843855 1.955655e-01 9.180755e-02
max 1.000000 4.880140e-01 2.734190e-01
In [137]:
# What is in the p1_dog column?
image_predictions[image_predictions.p1_dog].p1.value_counts()
Out[137]:
golden_retriever               150
Labrador_retriever             100
Pembroke                        89
Chihuahua                       83
pug                             57
chow                            44
Samoyed                         43
toy_poodle                      39
Pomeranian                      38
cocker_spaniel                  30
malamute                        30
French_bulldog                  26
Chesapeake_Bay_retriever        23
miniature_pinscher              23
Staffordshire_bullterrier       20
German_shepherd                 20
Siberian_husky                  20
Cardigan                        19
Maltese_dog                     18
beagle                          18
Shetland_sheepdog               18
Eskimo_dog                      18
Lakeland_terrier                17
Shih-Tzu                        17
Rottweiler                      17
kuvasz                          16
Italian_greyhound               16
West_Highland_white_terrier     14
Great_Pyrenees                  14
basset                          13
                              ... 
Rhodesian_ridgeback              4
Tibetan_terrier                  4
Saluki                           4
Afghan_hound                     4
Weimaraner                       4
curly-coated_retriever           3
Brabancon_griffon                3
Greater_Swiss_Mountain_dog       3
Ibizan_hound                     3
giant_schnauzer                  3
Leonberg                         3
briard                           3
Irish_water_spaniel              3
Scottish_deerhound               3
cairn                            3
komondor                         3
Welsh_springer_spaniel           3
wire-haired_fox_terrier          2
Sussex_spaniel                   2
toy_terrier                      2
black-and-tan_coonhound          2
Appenzeller                      2
Australian_terrier               2
silky_terrier                    1
EntleBucher                      1
clumber                          1
Scotch_terrier                   1
groenendael                      1
standard_schnauzer               1
Japanese_spaniel                 1
Name: p1, Length: 111, dtype: int64
In [138]:
# Predictions that are not dog breeds
image_predictions[~image_predictions.p1_dog].p1.value_counts()
Out[138]:
seat_belt             22
web_site              19
teddy                 18
tennis_ball            9
dingo                  9
doormat                8
bath_towel             7
hamster                7
Siamese_cat            7
tub                    7
swing                  7
car_mirror             6
ice_bear               6
llama                  6
home_theater           6
ox                     5
minivan                5
hippopotamus           5
porcupine              5
shopping_cart          5
goose                  4
patio                  4
jigsaw_puzzle          4
bathtub                4
Arctic_fox             4
brown_bear             4
wombat                 4
bow_tie                4
hog                    4
guinea_pig             4
                      ..
pitcher                1
long-horned_beetle     1
lynx                   1
sundial                1
wild_boar              1
jersey                 1
hay                    1
rapeseed               1
pedestal               1
padlock                1
polecat                1
grille                 1
peacock                1
African_crocodile      1
bakery                 1
school_bus             1
bighorn                1
cup                    1
electric_fan           1
coil                   1
convertible            1
bookshop               1
coral_reef             1
damselfly              1
hummingbird            1
lacewing               1
killer_whale           1
flamingo               1
limousine              1
handkerchief           1
Name: p1, Length: 267, dtype: int64

tweet_json - data:

In [139]:
tweet_json.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 3 columns):
tweet_id          2356 non-null int64
retweet_count     2066 non-null float64
favorite_count    2066 non-null float64
dtypes: float64(2), int64(1)
memory usage: 55.3 KB
In [140]:
tweet_json[['retweet_count', 'favorite_count']].describe()
Out[140]:
retweet_count favorite_count
count 2066.000000 2066.000000
mean 2854.412875 8502.951113
std 4878.177743 12532.050831
min 12.000000 0.000000
25% 607.000000 1624.250000
50% 1348.500000 3756.000000
75% 3314.750000 10698.750000
max 76761.000000 142432.000000
In [141]:
tweet_json.tail()
Out[141]:
tweet_id retweet_count favorite_count
2351 666020888022790149 NaN NaN
2352 666020888022790149 NaN NaN
2353 666020888022790149 NaN NaN
2354 666020888022790149 NaN NaN
2355 666020888022790149 NaN NaN

Issue: there are multiple duplicated rows


Summary of Data Quality and Tidiness Issues

Issue 1: (Quality Issue) twitter_archive_enhanced data - timestamp and retweeted_status_timestamp are of data types strings instead of datetime

Issue 2 (Quality Issue) twitter_archive_enhanced data - Some dog names don't seem to be valid dog names and they start with lowercase letters

Issue 3: (Quality Issue) twitter_archive_enhanced data - There are rating denominators that are not equal to 10. Those that are multiples of 10 may be for ratings involving multiple dogs in an image. However, rating denominator values such as 11, 2, 16, 15, 7, and certainly 0 don't make senese.

Issue 4: (Quality Issues) twitter_archive_enhanced data - about 93.5% of the rating numerators range between 7 and 14. However, there are a few very large and very small values. The large numerator values could as well be ratings for a number of dogs in the same picture combined. I will see if there are outliers by divinding rating_numerator by rating_denominator after removing the observation with zero rating_denominator.

Issue 5: (Tidiness Issue) twitter_archive_enhanced data - rating_numerator and rating_denominator together convey a single value, that is a rating for the dog, and hence we don't need two separate columns for dog rating in the final cleaned version of our data file (separating the two, however, is a very good idea for clean up purposes).

Note: according to THE DOGTIONARY:

  • doggo - represents a big pupper, usually older
  • pupper - represents a small doggo, usually younger, and
  • puppo - represents a transitional phase between pupper and doggo.
  • floof - represents any dog really. However, this label is commonly given to dogs with seemingly excess fur.

According to THE DOGTIONARY, floofer does not seem to be a dog 'stage'. This is the reason why I thought melting `floofer` along with the other dog "stage" names was not the right course of action

Issue 6: (Tidiness Issue) the twitter_archive_enhanced data frame has three columns for one feature - dog 'stage' classification. This breaks the first of the three Tidy data requirements - that each variable forms a column.

Issue 7: (Quality Issue) image_predictions data - there is an inconsistency in the way dog breeds or prediction 'things' are spelled. Some are written in all lowercase letters and some are capitalized.

Issue 8: (Tidiness Issue) tweet_json data - there are multiple duplicated rows


Clean Data

In [142]:
twitter_archive_enhanced_clean = twitter_archive_enhanced.copy()
image_predictions_clean = image_predictions.copy()
tweet_json_clean = tweet_json.copy()

twitter_archive_enhanced - data

Issues (Quality)

Unreal dog names - they either start with lowercase letters or have the value 'None'. They are stored in unreal_names as a list.

Define

Replace all the ones that start with lowercase letters by 'None'

Code
In [143]:
unreal_names = twitter_archive_enhanced.name.str.extract('(^[a-z]+|None)', expand=True).loc[:,0].value_counts().keys()
indexes = twitter_archive_enhanced_clean[twitter_archive_enhanced_clean.name.isin(unreal_names)].index

# Replace all those unrealistic dog names by None
twitter_archive_enhanced_clean.loc[indexes, 'name'] = 'None'
Test
In [144]:
twitter_archive_enhanced_clean.name.str.extract('(^[a-z])', expand=True).loc[:,0].value_counts()
Out[144]:
Series([], Name: 0, dtype: int64)
Issue (Quality)

Unusual dog ratings: 960/00, 11/15/15, 24/7, 20/16

Define

Observations corresponding to ratings of 960/00, 11/15/15, 24/7, and 20/16 don't seem to be real WeRateDogs tweets (may be they are retweets), so remove these rows from the data.

Code
In [145]:
indexes = twitter_archive_enhanced_clean[twitter_archive_enhanced_clean.rating_denominator.isin([0, 7, 15, 16])].index
twitter_archive_enhanced_clean.drop(labels=indexes, axis=0, inplace=True)
Test
In [146]:
twitter_archive_enhanced_clean[twitter_archive_enhanced_clean.rating_denominator.isin([0, 7, 15, 16])].shape
Out[146]:
(0, 17)
Issues (Quality)

Dog ratings of 9/11 (two of them), 7/11, and 1/2

Define

The texts corresponding to these ratings suggest that we:

  • replace 9/11 by 14/10
  • replace 7/11 by 10/10
  • replace 1/2 by 9/10
Code
In [147]:
indexes = twitter_archive_enhanced_clean[(twitter_archive_enhanced_clean.rating_numerator == 9) & (twitter_archive_enhanced_clean.rating_denominator == 11)].index
twitter_archive_enhanced_clean.loc[indexes, 'rating_numerator'] = 14
twitter_archive_enhanced_clean.loc[indexes, 'rating_denominator'] = 10
In [148]:
idx = twitter_archive_enhanced_clean[(twitter_archive_enhanced_clean.rating_numerator == 7) & (twitter_archive_enhanced_clean.rating_denominator == 11)].index
twitter_archive_enhanced_clean.loc[idx, 'rating_numerator'] = 10
twitter_archive_enhanced_clean.loc[idx, 'rating_denominator'] = 10
In [149]:
idx = twitter_archive_enhanced_clean[(twitter_archive_enhanced_clean.rating_numerator == 1) & (twitter_archive_enhanced_clean.rating_denominator == 2)].index
twitter_archive_enhanced_clean.loc[idx, 'rating_numerator'] = 9
twitter_archive_enhanced_clean.loc[idx, 'rating_denominator'] = 10
Test
In [150]:
twitter_archive_enhanced_clean[twitter_archive_enhanced_clean.rating_denominator.isin([2, 11])].shape
Out[150]:
(0, 17)
Issue (Tidiness)

While the rating_numerator and rating_denominator separate columns served very important roles in our clean up efforts above, they represent two columns for the same variable (i.e., dog rating) and that violates one of the three Tidy data requirements

Define

Create a new column rating by dividing rating_numerator by rating_denominator, then drop these two columns

Code
In [151]:
twitter_archive_enhanced_clean['rating'] = twitter_archive_enhanced_clean['rating_numerator'] / twitter_archive_enhanced_clean['rating_denominator']
twitter_archive_enhanced_clean.drop(labels=['rating_numerator', 'rating_denominator'], axis=1, inplace=True)
Test
In [152]:
list(twitter_archive_enhanced_clean)
Out[152]:
['tweet_id',
 'in_reply_to_status_id',
 'in_reply_to_user_id',
 'timestamp',
 'source',
 'text',
 'retweeted_status_id',
 'retweeted_status_user_id',
 'retweeted_status_timestamp',
 'expanded_urls',
 'name',
 'doggo',
 'floofer',
 'pupper',
 'puppo',
 'rating']
In [153]:
twitter_archive_enhanced_clean.rating.sample(6)
Out[153]:
59      1.2
359     1.2
1136    1.2
2334    0.3
542     1.1
1199    1.1
Name: rating, dtype: float64
In [154]:
# Some stats of the variable rating = rating_numerator / rating_denominator
twitter_archive_enhanced_clean.rating.describe()
Out[154]:
count    2352.000000
mean        1.222109
std         4.085773
min         0.000000
25%         1.000000
50%         1.100000
75%         1.200000
max       177.600000
Name: rating, dtype: float64
In [155]:
twitter_archive_enhanced_clean.rating.sort_values(ascending=False)
Out[155]:
979     177.6
189      66.6
2074     42.0
188      42.0
290      18.2
340       7.5
695       7.5
763       2.7
1712      2.6
55        1.7
285       1.5
291       1.5
339       1.4
68        1.4
1068      1.4
369       1.4
41        1.4
426       1.4
758       1.4
298       1.4
149       1.4
199       1.4
372       1.4
418       1.4
195       1.4
631       1.4
56        1.4
404       1.4
78        1.4
76        1.4
        ...  
1249      0.3
2202      0.3
1928      0.3
1314      0.3
1692      0.3
2183      0.3
2136      0.3
2239      0.3
1189      0.3
2310      0.2
2246      0.2
2349      0.2
1764      0.2
2237      0.2
1761      0.2
2079      0.2
1920      0.2
1598      0.2
2326      0.2
1165      0.2
605       0.1
2338      0.1
1869      0.1
2038      0.1
2091      0.1
1940      0.1
1446      0.1
2261      0.1
1016      0.0
315       0.0
Name: rating, Length: 2352, dtype: float64
In [156]:
# Texts corresponding to very high rating values and rating values of 0
indexes = twitter_archive_enhanced_clean[(twitter_archive_enhanced_clean.rating == 0) | (twitter_archive_enhanced_clean.rating > 1.5)].index

for idx in indexes:
    print(str(idx) + " - ", twitter_archive_enhanced_clean.loc[idx, 'text'])
    print('\n')
55 -  @roushfenway These are good dogs but 17/10 is an emotional impulse rating. More like 13/10s


188 -  @dhmontgomery We also gave snoop dogg a 420/10 but I think that predated your research


189 -  @s8n You tried very hard to portray this good boy as not so good, but you have ultimately failed. His goodness shines through. 666/10


290 -  @markhoppus 182/10


315 -  When you're so blinded by your systematic plagiarism that you forget what day it is. 0/10 https://t.co/YbEJPkg4Ag


340 -  RT @dog_rates: This is Logan, the Chow who lived. He solemnly swears he's up to lots of good. H*ckin magical af 9.75/10 https://t.co/yBO5wu…


695 -  This is Logan, the Chow who lived. He solemnly swears he's up to lots of good. H*ckin magical af 9.75/10 https://t.co/yBO5wuqaPS


763 -  This is Sophie. She's a Jubilant Bush Pupper. Super h*ckin rare. Appears at random just to smile at the locals. 11.27/10 would smile back https://t.co/QFaUiIHxHq


979 -  This is Atticus. He's quite simply America af. 1776/10 https://t.co/GRXwMxLBkh


1016 -  PUPDATE: can't see any. Even if I could, I couldn't reach them to pet. 0/10 much disappointment https://t.co/c7WXaB2nqX


1712 -  Here we have uncovered an entire battalion of holiday puppers. Average of 11.26/10 https://t.co/eNm2S6p9BD


2074 -  After so many requests... here you go.

Good dogg. 420/10 https://t.co/yfAAo1gdeY


Issues (Quality)

As we see in the above texts, there appear to be some errors in parsing dog ratings.

  • 26/10 should be 11.26/10 = 1.126
  • 27/10 should be 11.27/10 = 1.127
  • 75/10 shoud be 9.75/10 = 0.975

Also, the first of the two 420/10 ratings does not correspond to a dog rating rather it is of Snoop Dog's.

Define
  • Replace 2.6 rating by 1.126
  • Replace 2.7 rating by 1.127
  • Replace 7.5 rating by 0.975
  • Remove the observation corresponding to Snoop Dogs's 420/10 = 42.0 rating
Code
In [157]:
idx = twitter_archive_enhanced_clean[twitter_archive_enhanced_clean.rating == 2.6].index
twitter_archive_enhanced_clean.loc[idx, 'rating'] = 1.126
In [158]:
idx = twitter_archive_enhanced_clean[twitter_archive_enhanced_clean.rating == 2.7].index
twitter_archive_enhanced_clean.loc[idx, 'rating'] = 1.127
In [159]:
idx = twitter_archive_enhanced_clean[twitter_archive_enhanced_clean.rating == 7.5].index
twitter_archive_enhanced_clean.loc[idx, 'rating'] = 0.975
In [160]:
idx = twitter_archive_enhanced_clean[twitter_archive_enhanced_clean.rating==42.0].index[0]
twitter_archive_enhanced_clean.drop(labels=idx, axis=0, inplace=True)
Test
In [161]:
twitter_archive_enhanced_clean[(twitter_archive_enhanced_clean.rating == 0) | (twitter_archive_enhanced_clean.rating > 1.5)].rating
Out[161]:
55        1.7
189      66.6
290      18.2
315       0.0
979     177.6
1016      0.0
2074     42.0
Name: rating, dtype: float64

Notice that the ratings we wanted to replace are replaced and we leave the other outliers as they are


image_predictions - data

Issue (Quality)

Inconsistency in the way the names of prediction dogs are written - some start with capital letters and others with lowercase letters.

Define

Capitalize all names using the str.title() function

Code
In [162]:
image_predictions_clean[['p1', 'p2', 'p3']] = image_predictions_clean[['p1', 'p2', 'p3']].applymap(lambda x: x.title())
Test
In [163]:
image_predictions_clean.p1.value_counts()
Out[163]:
Golden_Retriever             150
Labrador_Retriever           100
Pembroke                      89
Chihuahua                     83
Pug                           57
Chow                          44
Samoyed                       43
Toy_Poodle                    39
Pomeranian                    38
Cocker_Spaniel                30
Malamute                      30
French_Bulldog                26
Miniature_Pinscher            23
Chesapeake_Bay_Retriever      23
Seat_Belt                     22
Staffordshire_Bullterrier     20
Siberian_Husky                20
German_Shepherd               20
Cardigan                      19
Web_Site                      19
Teddy                         18
Shetland_Sheepdog             18
Maltese_Dog                   18
Beagle                        18
Eskimo_Dog                    18
Lakeland_Terrier              17
Shih-Tzu                      17
Rottweiler                    17
Kuvasz                        16
Italian_Greyhound             16
                            ... 
Sea_Urchin                     1
Pillow                         1
Limousine                      1
Pitcher                        1
Polecat                        1
Binoculars                     1
Scorpion                       1
Lacewing                       1
Dhole                          1
Canoe                          1
Nail                           1
Stove                          1
Bow                            1
Bookshop                       1
Four-Poster                    1
Basketball                     1
Terrapin                       1
Swab                           1
Otter                          1
Madagascar_Cat                 1
Boathouse                      1
Fire_Engine                    1
Cougar                         1
Ice_Lolly                      1
Soccer_Ball                    1
Studio_Couch                   1
Agama                          1
Piggy_Bank                     1
Carton                         1
Bakery                         1
Name: p1, Length: 378, dtype: int64

tweet_json - data

In [164]:
tweet_json_clean.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 3 columns):
tweet_id          2356 non-null int64
retweet_count     2066 non-null float64
favorite_count    2066 non-null float64
dtypes: float64(2), int64(1)
memory usage: 55.3 KB
Issue (Quality)

retweet_count and favorite_count are of dtype float and not integers

Define

Convert retweet_count and favorite_count to int data types

Code
In [165]:
tweet_json_clean[['retweet_count', 'favorite_count']] = tweet_json_clean[['retweet_count', 'favorite_count']].fillna(0).astype(int)
Test
In [166]:
tweet_json_clean.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 3 columns):
tweet_id          2356 non-null int64
retweet_count     2356 non-null int32
favorite_count    2356 non-null int32
dtypes: int32(2), int64(1)
memory usage: 36.9 KB
In [167]:
tweet_json_clean.sample(5)
Out[167]:
tweet_id retweet_count favorite_count
1896 668989615043424256 342 702
1868 669603084620980224 387 976
614 777189768882946048 4814 15491
850 743545585370791937 1021 3748
1793 670826280409919488 4270 5749

Duplicates?

In [168]:
sum(twitter_archive_enhanced_clean.duplicated()), sum(image_predictions_clean.duplicated()), sum(tweet_json_clean.duplicated())
Out[168]:
(0, 0, 289)
Issue (Tidiness)

There are 289 duplicated observations in the tweet_json data

Define

Remove the duplicated observations in the tweet_json data

Code
In [169]:
tweet_json_clean = tweet_json_clean[~tweet_json_clean.tweet_id.duplicated()]
Test
In [170]:
sum(tweet_json_clean.duplicated())
Out[170]:
0

Merge the three data frames
In [171]:
twitter_archive_enhanced_clean.tweet_id[:3]
Out[171]:
0    892420643555336193
1    892177421306343426
2    891815181378084864
Name: tweet_id, dtype: int64
In [172]:
image_predictions_clean.tweet_id[:3]
Out[172]:
0    666020888022790149
1    666029285002620928
2    666033412701032449
Name: tweet_id, dtype: int64
In [173]:
tweet_json_clean.tweet_id[:3]
Out[173]:
0    892420643555336193
1    892177421306343426
2    891815181378084864
Name: tweet_id, dtype: int64
In [174]:
# tweet_id in the twitter_archive_enhanced_clean data appears in descending order
twitter_archive_enhanced_clean.tweet_id.equals(twitter_archive_enhanced_clean.sort_values(by='tweet_id', ascending=False)['tweet_id'])
Out[174]:
True
In [175]:
# tweet_id in the image_prediction_clean data appears in ascending order
image_predictions_clean.tweet_id.equals(image_predictions_clean.sort_values(by='tweet_id')['tweet_id'])
Out[175]:
True
In [176]:
# tweet_id in the tweet_json_clean data appears in descending order
tweet_json_clean['tweet_id'].equals(tweet_json_clean.sort_values(by='tweet_id', ascending=False)['tweet_id'])
Out[176]:
True

In the image_predictions data, tweet_id appears in increasing order, however, in the other two data sets tweet_id appears in descending order. For uniformity, I will arrange all in increasing order.

In [177]:
twitter_archive_enhanced_clean = twitter_archive_enhanced_clean.sort_values(by='tweet_id')
tweet_json_clean = tweet_json_clean.sort_values(by='tweet_id')
tweet_id is the only column that is common to the three data frames
In [178]:
ser = pd.Series(list(twitter_archive_enhanced_clean) + list(image_predictions_clean) + list(tweet_json_clean))
ser[ser.duplicated()]
Out[178]:
16    tweet_id
28    tweet_id
dtype: object
Merge the three data frames
In [180]:
twitter_archive_draft = pd.merge(tweet_json_clean, image_predictions_clean, on='tweet_id')
In [181]:
twitter_archive_draft.shape
Out[181]:
(2066, 14)
In [182]:
twitter_archive_draft = pd.merge(twitter_archive_draft, twitter_archive_enhanced_clean, on='tweet_id')
In [183]:
twitter_archive_draft.shape
Out[183]:
(2065, 29)
In [184]:
twitter_archive_draft.sample(5)
Out[184]:
tweet_id retweet_count favorite_count jpg_url img_num p1 p1_conf p1_dog p2 p2_conf ... retweeted_status_id retweeted_status_user_id retweeted_status_timestamp expanded_urls name doggo floofer pupper puppo rating
1220 744234799360020481 76761 127437 https://pbs.twimg.com/ext_tw_video_thumb/74423... 1 Labrador_Retriever 0.825333 True Ice_Bear 0.044681 ... NaN NaN NaN https://twitter.com/dog_rates/status/744234799... None doggo None None None 1.3
376 673148804208660480 662 1761 https://pbs.twimg.com/media/CVeBQwiUsAAqhLw.jpg 1 Tub 0.873010 False Bathtub 0.091434 ... NaN NaN NaN https://twitter.com/dog_rates/status/673148804... Fletcher None None pupper None 0.8
1589 798673117451325440 6226 0 https://pbs.twimg.com/media/CV_cnjHWUAADc-c.jpg 1 Dough 0.806757 False Bakery 0.027907 ... 6.755011e+17 4.196984e+09 2015-12-12 02:23:01 +0000 https://twitter.com/dog_rates/status/675501075... None None None None None 1.3
1605 800751577355128832 3108 11403 https://pbs.twimg.com/media/CxzXOyBW8AEu_Oi.jpg 2 Cocker_Spaniel 0.771984 True Miniature_Poodle 0.076653 ... NaN NaN NaN https://twitter.com/dog_rates/status/800751577... Mauve None None None None 1.2
1885 849336543269576704 2026 11960 https://pbs.twimg.com/media/C8lzFC4XcAAQxB4.jpg 1 Patio 0.521788 False Prison 0.149544 ... NaN NaN NaN https://twitter.com/dog_rates/status/849336543... None None None None None 1.1

5 rows × 29 columns

Issue (Tidiness)

twitter_archive_draft has three columns for one feature - dog 'stage' classification. This violates the first of the three Tidy data requirements - that each variable forms a column

Define

Melt the doggo, puppo, and pupper columns dog_stage_name and dog_stage columns. Then drop the dog_stage_name column because the values in three columns doggo, puppo, and pupper are the stage names themselves, the dog_stage_name column will not have any unique information.

Code
In [185]:
twitter_archive_draft.columns
Out[185]:
Index(['tweet_id', 'retweet_count', 'favorite_count', 'jpg_url', 'img_num',
       'p1', 'p1_conf', 'p1_dog', 'p2', 'p2_conf', 'p2_dog', 'p3', 'p3_conf',
       'p3_dog', 'in_reply_to_status_id', 'in_reply_to_user_id', 'timestamp',
       'source', 'text', 'retweeted_status_id', 'retweeted_status_user_id',
       'retweeted_status_timestamp', 'expanded_urls', 'name', 'doggo',
       'floofer', 'pupper', 'puppo', 'rating'],
      dtype='object')
In [186]:
twitter_archive_draft = pd.melt(twitter_archive_draft,
                               id_vars=['tweet_id','retweet_count','favorite_count','jpg_url','img_num',
                                       'p1','p1_conf','p1_dog','p2','p2_conf','p2_dog','p3','p3_conf',
                                       'p3_dog','in_reply_to_status_id','in_reply_to_user_id','timestamp',
                                       'source','text','retweeted_status_id','retweeted_status_user_id',
                                       'retweeted_status_timestamp','expanded_urls','name','floofer','rating'],
                               value_vars=['doggo','pupper','puppo'],
                               var_name='dog_stage_name',
                               value_name='dog_stage')
twitter_archive_draft = twitter_archive_draft.drop('dog_stage_name', axis=1)
In [187]:
twitter_archive_draft.columns
Out[187]:
Index(['tweet_id', 'retweet_count', 'favorite_count', 'jpg_url', 'img_num',
       'p1', 'p1_conf', 'p1_dog', 'p2', 'p2_conf', 'p2_dog', 'p3', 'p3_conf',
       'p3_dog', 'in_reply_to_status_id', 'in_reply_to_user_id', 'timestamp',
       'source', 'text', 'retweeted_status_id', 'retweeted_status_user_id',
       'retweeted_status_timestamp', 'expanded_urls', 'name', 'floofer',
       'rating', 'dog_stage'],
      dtype='object')
In [188]:
sum(twitter_archive_draft.duplicated())
Out[188]:
3804

There are 3806 duplicated observations, and I will drop these duplicates

In [189]:
twitter_archive_draft = twitter_archive_draft.drop_duplicates()
Test
In [190]:
sum(twitter_archive_draft.duplicated())
Out[190]:
0
In [193]:
twitter_archive_draft.tail()
Out[193]:
tweet_id retweet_count favorite_count jpg_url img_num p1 p1_conf p1_dog p2 p2_conf ... source text retweeted_status_id retweeted_status_user_id retweeted_status_timestamp expanded_urls name floofer rating dog_stage
6087 867421006826221569 2572 16293 https://pbs.twimg.com/media/DAmyy8FXYAIH8Ty.jpg 1 Eskimo_Dog 0.616457 True Siberian_Husky 0.381330 ... <a href="http://twitter.com/download/iphone" r... This is Shikha. She just watched you drop a sk... NaN NaN NaN https://twitter.com/dog_rates/status/867421006... Shikha None 1.2 puppo
6114 874012996292530176 10516 34551 https://pbs.twimg.com/media/DCEeLxjXsAAvNSM.jpg 2 Cardigan 0.806674 True Pembroke 0.116622 ... <a href="http://twitter.com/download/iphone" r... This is Sebastian. He can't see all the colors... NaN NaN NaN https://twitter.com/dog_rates/status/874012996... Sebastian None 1.3 puppo
6131 878776093423087618 4142 19321 https://pbs.twimg.com/media/DDIKMXzW0AEibje.jpg 2 Italian_Greyhound 0.734684 True Whippet 0.150487 ... <a href="http://twitter.com/download/iphone" r... This is Snoopy. He's a proud #PrideMonthPuppo.... NaN NaN NaN https://twitter.com/dog_rates/status/878776093... Snoopy None 1.3 puppo
6180 889531135344209921 2233 15014 https://pbs.twimg.com/media/DFg_2PVW0AEHN3p.jpg 1 Golden_Retriever 0.953442 True Labrador_Retriever 0.013834 ... <a href="http://twitter.com/download/iphone" r... This is Stuart. He's sporting his favorite fan... NaN NaN NaN https://twitter.com/dog_rates/status/889531135... Stuart None 1.3 puppo
6182 889665388333682689 10037 47869 https://pbs.twimg.com/media/DFi579UWsAAatzw.jpg 1 Pembroke 0.966327 True Cardigan 0.027356 ... <a href="http://twitter.com/download/iphone" r... Here's a puppo that seems to be on the fence a... NaN NaN NaN https://twitter.com/dog_rates/status/889665388... None None 1.3 puppo

5 rows × 27 columns

In [194]:
twitter_archive_draft[twitter_archive_draft.dog_stage=='doggo'].shape
Out[194]:
(80, 27)
In [195]:
twitter_archive_draft.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2391 entries, 0 to 6182
Data columns (total 27 columns):
tweet_id                      2391 non-null int64
retweet_count                 2391 non-null int32
favorite_count                2391 non-null int32
jpg_url                       2391 non-null object
img_num                       2391 non-null int64
p1                            2391 non-null object
p1_conf                       2391 non-null float64
p1_dog                        2391 non-null bool
p2                            2391 non-null object
p2_conf                       2391 non-null float64
p2_dog                        2391 non-null bool
p3                            2391 non-null object
p3_conf                       2391 non-null float64
p3_dog                        2391 non-null bool
in_reply_to_status_id         27 non-null float64
in_reply_to_user_id           27 non-null float64
timestamp                     2391 non-null object
source                        2391 non-null object
text                          2391 non-null object
retweeted_status_id           92 non-null float64
retweeted_status_user_id      92 non-null float64
retweeted_status_timestamp    92 non-null object
expanded_urls                 2391 non-null object
name                          2391 non-null object
floofer                       2391 non-null object
rating                        2391 non-null float64
dog_stage                     2391 non-null object
dtypes: bool(3), float64(8), int32(2), int64(2), object(12)
memory usage: 455.3+ KB
Issues (Quality Issues)
  • None values
  • timestamp, retweet_status_timestamp, floofer, and dog_stage have improper data types
Define
  • Replace all None values by np.nan
  • Convert timestamp and retweet_status_timestamp to datetime
  • Convert the floofer variable to boolean
  • Convert dog_stage to categorical data types
Codes
In [196]:
# Replace 'None' by np.nan
twitter_archive_draft = twitter_archive_draft.replace('None', np.nan)
In [197]:
# timestamp and retweet_status_timestamp to datetime
twitter_archive_draft.timestamp = pd.to_datetime(twitter_archive_draft.timestamp)
twitter_archive_draft.retweeted_status_timestamp = pd.to_datetime(twitter_archive_draft.retweeted_status_timestamp)
In [198]:
# 'floofer' to bool
twitter_archive_draft['floofer'] = twitter_archive_draft['floofer'].map(lambda x: 1 if pd.notnull(x) else 0).astype('bool')
twitter_archive_draft['floofer'].value_counts()
Out[198]:
False    2382
True        9
Name: floofer, dtype: int64
In [199]:
# dog_stage as category
twitter_archive_draft.dog_stage = twitter_archive_draft.dog_stage.astype('category')
Test
In [200]:
twitter_archive_draft.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2391 entries, 0 to 6182
Data columns (total 27 columns):
tweet_id                      2391 non-null int64
retweet_count                 2391 non-null int64
favorite_count                2391 non-null int64
jpg_url                       2391 non-null object
img_num                       2391 non-null int64
p1                            2391 non-null object
p1_conf                       2391 non-null float64
p1_dog                        2391 non-null bool
p2                            2391 non-null object
p2_conf                       2391 non-null float64
p2_dog                        2391 non-null bool
p3                            2391 non-null object
p3_conf                       2391 non-null float64
p3_dog                        2391 non-null bool
in_reply_to_status_id         27 non-null float64
in_reply_to_user_id           27 non-null float64
timestamp                     2391 non-null datetime64[ns]
source                        2391 non-null object
text                          2391 non-null object
retweeted_status_id           92 non-null float64
retweeted_status_user_id      92 non-null float64
retweeted_status_timestamp    92 non-null datetime64[ns]
expanded_urls                 2391 non-null object
name                          1574 non-null object
floofer                       2391 non-null bool
rating                        2391 non-null float64
dog_stage                     326 non-null category
dtypes: bool(4), category(1), datetime64[ns](2), float64(8), int64(4), object(8)
memory usage: 441.4+ KB

Store the `twitter_archive_draft` data frame in `twitter_archive_master.csv` CSV file

In [201]:
folder_name = 'we_rate_dogs'
filename = os.path.join(folder_name, 'twitter_archive_master.csv')
twitter_archive_draft.to_csv(filename, index=False)
Also store the `twitter_archive_draft` data frame in a SQLite database

Connect to a database
from sqlalchemy import create_engine

Create SQLAlchemy Engine and empty we_rate_dogs database
engine = create_engine('sqlite:///we_rate_dogs.db')

Store twitter_archive_draft DataFrame in a table called twitter_archive_master in we_rate_dogs.db
twitter_archive_draft.to_sql('twitter_archive_master', engine, if_exists='replace', index=False, chunksize=499)

The above sequence of codes led to this error: OperationalError: too many SQL variables

In [202]:
# Using sqlite3
import sqlite3

# Create the connection
conn = sqlite3.connect('we_rate_dogs.db')

# Convert twitter_archive_draft DataFrame to twitter_archive_master table in the databsase
twitter_archive_draft.to_sql("twitter_archive_master", conn, if_exists="replace", index=False)
Check
In [203]:
conn = sqlite3.connect('we_rate_dogs.db')
df = pd.read_sql_query('SELECT * FROM twitter_archive_master limit 3;', conn)
conn.close()

df
Out[203]:
tweet_id retweet_count favorite_count jpg_url img_num p1 p1_conf p1_dog p2 p2_conf ... source text retweeted_status_id retweeted_status_user_id retweeted_status_timestamp expanded_urls name floofer rating dog_stage
0 666020888022790149 514 2563 https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg 1 Welsh_Springer_Spaniel 0.465074 1 Collie 0.156665 ... <a href="http://twitter.com/download/iphone" r... Here we have a Japanese Irish Setter. Lost eye... None None None https://twitter.com/dog_rates/status/666020888... None 0 0.8 None
1 666029285002620928 47 130 https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg 1 Redbone 0.506826 1 Miniature_Pinscher 0.074192 ... <a href="http://twitter.com/download/iphone" r... This is a western brown Mitsubishi terrier. Up... None None None https://twitter.com/dog_rates/status/666029285... None 0 0.7 None
2 666033412701032449 44 125 https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg 1 German_Shepherd 0.596461 1 Malinois 0.138584 ... <a href="http://twitter.com/download/iphone" r... Here is a very happy pup. Big fan of well-main... None None None https://twitter.com/dog_rates/status/666033412... None 0 0.9 None

3 rows × 27 columns


Analyze and Visualize the data

In [204]:
twitter_archive_draft.groupby(by="dog_stage")['rating'].mean() * 10
Out[204]:
dog_stage
doggo     11.862500
pupper    10.708423
puppo     12.041667
Name: rating, dtype: float64
In [227]:
folder = './we_rate_dogs/image/'
filename = os.path.join(folder, 'dog_stages.png')
(twitter_archive_draft.groupby(by="dog_stage")['rating'].mean() * 10).plot(kind='bar', color="#CCCC99")
plt.ylabel('Average rating (out of 10)')
plt.xlabel('Dog Stage')
plt.grid(b=True, which='major', axis='both', lw=0.25)
plt.xticks(rotation=45, ha="right", color='gray')
plt.savefig(filename, bbox_inches='tight', pad_inches=0.5);

Observation:

  • The average rating for puppos (about $\frac{12.04}{10}$) is slighly higher than the average rating of doggos (about $\frac{11.86}{10}$), which in turn is higher than the average rating of puppers (about $\frac{10.71}{10}$)
In [206]:
twitter_archive_draft.name.value_counts()[:10]
Out[206]:
Cooper     13
Oliver     12
Charlie    11
Lucy       11
Bo         11
Penny      10
Tucker     10
Winston     9
Milo        8
Lola        8
Name: name, dtype: int64

Observation:

  • The top ten dog names in this data set are Cooper, Oliver, Charlie, Bo, Lucy, Penny, Tucker, Winston, Lola, and Milo
In [207]:
twitter_archive_draft.timestamp.min(), twitter_archive_draft.timestamp.max()
Out[207]:
(Timestamp('2015-11-15 22:32:08'), Timestamp('2017-08-01 16:23:56'))

The timestamp for tweets in this data set range from Nov 15, 2015 to Aug 1, 2017. To compare number of tweets by month, I will restrict timestamp to only all of 2016.

In [224]:
from datetime import datetime as dt

mask = (twitter_archive_draft.timestamp > '2015-12-31') & (twitter_archive_draft.timestamp <= '2016-12-31')
df = twitter_archive_draft.loc[mask].copy()
df['month'] = df.timestamp.dt.month
df = df[['timestamp', 'month']]
df['month'] = df['month'].apply(lambda x: dt(2016, x, 1).strftime('%b'))
In [226]:
folder = './we_rate_dogs/image/'
filename = os.path.join(folder, 'monthly_tweets.png')

plt.figure(figsize=(10,8))
df.month.value_counts().plot(kind='bar', width=0.75, color="#CCCC99")
plt.ylabel('Number of Tweets')
plt.xlabel('Month (in 2016)')
plt.grid(b=True, which='major', axis='both', lw=0.25)
plt.xticks(rotation=45, ha='right', color="gray")
plt.savefig(filename, bbox_inches='tight', pad_inches=0.2);

Observation:

  • The top three months in the number of tweets by @weratedogs in 2016 were January, March and Febraury. The least number of tweets were in April.
In [210]:
import requests
from PIL import Image
from io import BytesIO

Top six most liked tweets

In [211]:
top_idxs = twitter_archive_draft.sort_values(by='favorite_count', ascending=False).index[:10]
In [212]:
top_six_idxs = twitter_archive_draft.loc[top_idxs][~pd.Series(twitter_archive_draft.loc[top_idxs, 'jpg_url']).duplicated()].index
In [213]:
r = requests.get(twitter_archive_draft.loc[top_six_idxs[0], 'jpg_url'])
i = Image.open(BytesIO(r.content))
print('Tweet - ', twitter_archive_draft.loc[top_six_idxs[0], 'text'].split('https')[0])
i
Tweet -  Here's a super supportive puppo participating in the Toronto  #WomensMarch today. 13/10 
Out[213]:
In [214]:
r = requests.get(twitter_archive_draft.loc[top_six_idxs[1], 'jpg_url'])
i = Image.open(BytesIO(r.content))
print('Tweet - ', twitter_archive_draft.loc[top_six_idxs[1], 'text'].split('https')[0])
i
Tweet -  Here's a doggo realizing you can stand in a pool. 13/10 enlightened af (vid by Tina Conrad) 
Out[214]:
In [230]:
r = requests.get(twitter_archive_draft.loc[top_six_idxs[2], 'jpg_url'])
i = Image.open(BytesIO(r.content))
print('Tweet - ', twitter_archive_draft.loc[top_six_idxs[2], 'text'].split('https')[0])
i
Tweet -  This is Jamesy. He gives a kiss to every other pupper he sees on his walk. 13/10 such passion, much tender 
Out[230]:
In [216]:
r = requests.get(twitter_archive_draft.loc[top_six_idxs[3], 'jpg_url'])
i = Image.open(BytesIO(r.content))
print('Tweet - ', twitter_archive_draft.loc[top_six_idxs[3], 'text'].split('https')[0])
i
Tweet -  This is Stephan. He just wants to help. 13/10 such a good boy 
Out[216]:
In [217]:
r = requests.get(twitter_archive_draft.loc[top_six_idxs[4], 'jpg_url'])
i = Image.open(BytesIO(r.content))
print('Tweet - ', twitter_archive_draft.loc[top_six_idxs[4], 'text'].split('https')[0])
i
Tweet -  Here's a doggo blowing bubbles. It's downright legendary. 13/10 would watch on repeat forever (vid by Kent Duryee) 
Out[217]:
In [218]:
r = requests.get(twitter_archive_draft.loc[top_six_idxs[5], 'jpg_url'])
i = Image.open(BytesIO(r.content))
print('Tweet - ', twitter_archive_draft.loc[top_six_idxs[5], 'text'].split('https')[0])
i
Tweet -  This is Duddles. He did an attempt. 13/10 someone help him (vid by Georgia Felici) 
Out[218]: